Goto

Collaborating Authors

 random split



A Honest Cross-Validation Estimator for Prediction Performance

Pan, Tianyu, Yu, Vincent Z., Devanarayan, Viswanath, Tian, Lu

arXiv.org Machine Learning

Cross-validation is a standard tool for obtaining a honest assessment of the performance of a prediction model. The commonly used version repeatedly splits data, trains the prediction model on the training set, evaluates the model performance on the test set, and averages the model performance across different data splits. A well-known criticism is that such cross-validation procedure does not directly estimate the performance of the particular model recommended for future use. In this paper, we propose a new method to estimate the performance of a model trained on a specific (random) training set. A naive estimator can be obtained by applying the model to a disjoint testing set. Surprisingly, cross-validation estimators computed from other random splits can be used to improve this naive estimator within a random-effects model framework. We develop two estimators -- a hierarchical Bayesian estimator and an empirical Bayes estimator -- that perform similarly to or better than both the conventional cross-validation estimator and the naive single-split estimator. Simulations and a real-data example demonstrate the superior performance of the proposed method.



BenchMake: Turn any scientific data set into a reproducible benchmark

Barnard, Amanda S

arXiv.org Artificial Intelligence

Benchmark data sets are curated collections that enable consistent, reproducible, and objective evaluation of algorithms and models [1, 2]. They are essential for comparing algorithm performance fairly, particularly in machine learning (ML) and artificial intelligence (AI), where the suitability of algorithms can vary widely based on data structure, dimensionality, and distribution [3, 4]. For instance, algorithms that perform exceptionally on structured, tabular data may not generalise well to unstructured image or textual data [5]. Established benchmarks such as ImageNet [2], CIFAR data sets [6], and OpenML benchmarks for structured data [7] have driven innovation by providing clear metrics for progress, fostering reproducibility and trust within the research community [8]. However, in computational sciences, standardised benchmarks remain rare and challenging to establish due to the intrinsic complexity, heterogeneity, and domain specificity of scientific data [9]. Scientific data sets can be represented in a variety of ways (tables, image, text, graphs, signals), often requiring extensive pre-processing, specialised evaluation metrics, and are subject to measurement noise, natural variability, and data imbalance [10].


TRIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER

Aguilar, Sergio Torres

arXiv.org Artificial Intelligence

--This paper introduces TRIDIS (Tria Digita Scri-bunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM-Llama3-V 2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten T ext Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.


Understanding the Limits of Deep Tabular Methods with Temporal Shift

Cai, Hao-Run, Ye, Han-Jia

arXiv.org Artificial Intelligence

Deep tabular models have demonstrated remarkable success on i.i.d. data, excelling in a variety of structured data tasks. However, their performance often deteriorates under temporal distribution shifts, where trends and periodic patterns are present in the evolving data distribution over time. In this paper, we explore the underlying reasons for this failure in capturing temporal dependencies. We begin by investigating the training protocol, revealing a key issue in how model selection perform. While existing approaches use temporal ordering for splitting validation set, we show that even a random split can significantly improve model performance. By minimizing the time lag between training data and test time, while reducing the bias in validation, our proposed training protocol significantly improves generalization across various methods. Furthermore, we analyze how temporal data affects deep tabular representations, uncovering that these models often fail to capture crucial periodic and trend information. To address this gap, we introduce a plug-and-play temporal embedding method based on Fourier series expansion to learn and incorporate temporal patterns, offering an adaptive approach to handle temporal shifts. Our experiments demonstrate that this temporal embedding, combined with the improved training protocol, provides a more effective and robust framework for learning from temporal tabular data.


Excited-state nonadiabatic dynamics in explicit solvent using machine learned interatomic potentials

Tiefenbacher, Maximilian X., Bachmair, Brigitta, Chen, Cheng Giuseppe, Westermayr, Julia, Marquetand, Philipp, Dietschreit, Johannes C. B., González, Leticia

arXiv.org Artificial Intelligence

Excited-state nonadiabatic simulations with quantum mechanics/molecular mechanics (QM/MM) are essential to understand photoinduced processes in explicit environments. However, the high computational cost of the underlying quantum chemical calculations limits its application in combination with trajectory surface hopping methods. Here, we use FieldSchNet, a machine-learned interatomic potential capable of incorporating electric field effects into the electronic states, to replace traditional QM/MM electrostatic embedding with its ML/MM counterpart for nonadiabatic excited state trajectories. The developed method is applied to furan in water, including five coupled singlet states. Our results demonstrate that with sufficiently curated training data, the ML/MM model reproduces the electronic kinetics and structural rearrangements of QM/MM surface hopping reference simulations. Furthermore, we identify performance metrics that provide robust and interpretable validation of model accuracy.


Enhancing Drug-Target Interaction Prediction through Transfer Learning from Activity Cliff Prediction Tasks

Ibragimova, Regina, Iliadis, Dimitrios, Waegeman, Willem

arXiv.org Artificial Intelligence

Recently, machine learning (ML) has gained popularity in the early stages of drug discovery. This trend is unsurprising given the increasing volume of relevant experimental data and the continuous improvement of ML algorithms. However, conventional models, which rely on the principle of molecular similarity, often fail to capture the complexities of chemical interactions, particularly those involving activity cliffs (ACs) - compounds that are structurally similar but exhibit evidently different activity behaviors. In this work, we address two distinct yet related tasks: (1) activity cliff (AC) prediction and (2) drug-target interaction (DTI) prediction. Leveraging insights gained from the AC prediction task, we aim to improve the performance of DTI prediction through transfer learning. A universal model was developed for AC prediction, capable of identifying activity cliffs across diverse targets. Insights from this model were then incorporated into DTI prediction, enabling better handling of challenging cases involving ACs while maintaining similar overall performance. This approach establishes a strong foundation for integrating AC awareness into predictive models for drug discovery. Scientific Contribution This study presents a novel approach that applies transfer learning from AC prediction to enhance DTI prediction, addressing limitations of traditional similarity-based models. By introducing AC-awareness, we improve DTI model performance in structurally complex regions, demonstrating the benefits of integrating compound-specific and protein-contextual information. Unlike previous studies, which treat AC and DTI predictions as separate problems, this work establishes a unified framework to address both data scarcity and prediction challenges in drug discovery.


TabReD: A Benchmark of Tabular Machine Learning in-the-Wild

Rubachev, Ivan, Kartashev, Nikolay, Gorishniy, Yury, Babenko, Artem

arXiv.org Artificial Intelligence

Benchmarks that closely reflect downstream application scenarios are essential for the streamlined adoption of new research in tabular machine learning (ML). In this work, we examine existing tabular benchmarks and find two common characteristics of industry-grade tabular data that are underrepresented in the datasets available to the academic community. First, tabular data often changes over time in real-world deployment scenarios. This impacts model performance and requires time-based train and test splits for correct model evaluation. Yet, existing academic tabular datasets often lack timestamp metadata to enable such evaluation. Second, a considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines. For each specific dataset, this can have a different impact on the absolute and relative number of predictive, uninformative, and correlated features, which in turn can affect model selection. To fill the aforementioned gaps in academic benchmarks, we introduce TabReD -- a collection of eight industry-grade tabular datasets covering a wide range of domains from finance to food delivery services. We assess a large number of tabular ML models in the feature-rich, temporally-evolving data setting facilitated by TabReD. We demonstrate that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits more common in academic benchmarks. Furthermore, on the TabReD datasets, MLP-like architectures and GBDT show the best results, while more sophisticated DL models are yet to prove their effectiveness.


EUROPA: A Legal Multilingual Keyphrase Generation Dataset

Salaün, Olivier, Piedboeuf, Frédéric, Berre, Guillaume Le, Hermelo, David Alfonso, Langlais, Philippe

arXiv.org Artificial Intelligence

Keyphrase generation has primarily been explored within the context of academic research articles, with a particular focus on scientific domains and the English language. In this work, we present EUROPA, a dataset for multilingual keyphrase generation in the legal domain. It is derived from legal judgments from the Court of Justice of the European Union (EU), and contains instances in all 24 EU official languages. We run multilingual models on our corpus and analyze the results, showing room for improvement on a domain-specific multilingual corpus such as the one we present.